The course is an open data science course for PhD students at the University of Helsinki. I'm especially expecting to learn some more R and R Markdown, and also to use GitHub.
The link to my GitHub repository is https://github.com/akarimo/IODS-project
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Thu Nov 26 21:18:57 2020"
The text continues here.
In this chapter I report what I did for the regression analysis assignment. I used JYTOPKYS3 survey data collected by Kimmo Vehkalahti between 3.12.2014 and 10.1.2015. Below you can see the structure and dimensions of the data. It has information on the gender and age of the respondents, as well as on their learning skills, attitudes, and exam points in a statistical course exam. All other variables besides "gender", which is a dichotomous variable, are continuous variables. Variables "attitude", "deep", "stra", and "surf" are averages of 10 variables measured on a likert-scale. The analyzed version of the data includes 166 respondents, because respondents who didn't attend an exam were removed.
#read the data
learning2014 <- read.csv("~/Desktop/IODS/R project for ODS/IODS-project/data/learning2014.csv", row.names = 1)
#checking the structure and dimensions of the dataset
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
dim(learning2014)
## [1] 166 7
In the plot below you can see scatterplot matrices, distributions, and correlations between the variables by gender. The information on females is coloured pink, and the information on males is coloured light blue. There are a lot more females in the dataset, and most respondents are under 40 years old. The box plots and distributions of the variables by gender seem otherwise quite similar, but it seems that females have more variance in their attitude compared to males. Also, the scores for attitude seem in general a bit lower for females compared to males. We can also see that there is a larger share of males with lower scores for the variable "surf" compared to females.
The correlations between variables are in general quite low, and not statistically significant. We can also see from the scatterplots, that the relationships between these variables seem mostly quite random. Variable "surf" is significantly negatively correlated with both "attitude" and "deep", but when we look at it by gender, we see that this correlation is only significant for males. Variable "surf" is also significantly negatively correlated with "stra" for all respondents, but we dont find it significant by gender. Variable "attitude" and variable "Points" are significantly positively correlated with each other for both males and females.
#loading required packages
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
# creating a plot of the data
plot_learning2014 <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3),
lower = list(combo = wrap("facethist", bins = 20)))
plot_learning2014
After studying the relationships of these variables graphically and through correlations, it seems that the "attitude" variable might be explaining the variation in exam points the best. It also seems possible, that "stra" and "surf" would be significantly associated with "Points" in a more advanced model. Below I present a summary of a multiple linear regression model explaining exam points with attitude, strategic learning ("stra"), and surface learning ("surf"). All of these variables are approximately normally distributed according to visual interpretation of their distribution. However, the scatterplots representing the relationship between "Points" and "stra", and "Points" and "surf" don't indicate a linear relationship between these variables.
# creating a regression model with attitude, strategic learning, and
# surface learning as explanatory variables
model1 <- lm(Points ~ attitude + stra + surf, data = learning2014)
# print out a summary of the model
summary(model1)
##
## Call:
## lm(formula = Points ~ attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
In a model summary, significant parameter estimates are typically marked with stars. As you can see from the model summary above, attitude is in fact significantly positively associated with exam points, but strategic learning and surface learning are not. Columns "t value" and "Pr(>|t|)" in the "Coefficients" table include the t-value and p-value of the parameters. These values correspond to a statistical test of a null hypothesis, that the actual value of the parameter estimate would be zero. It is in general accepted, that p-values under 0.05 indicate a statistically significant parameter estimate. In this case the p-value of the "attitude" parameter is very low, but for "stra" and "surf" it's nowhere close 0.05.
Because "stra" and "surf" are not significantly associated with "Points", I remove them from the model. You can see the results of the new model below.
#remove stra and surf and run model again
model2 <- lm(Points ~ attitude, data = learning2014)
# print out a summary of the model
summary(model2)
##
## Call:
## lm(formula = Points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
From the model summary above you can see, that in this simple linear model the parameter estimate for the variable "attitude" is slightly higher than in the previous model. Also, in the "Residuals" table we see, that the residuals for this model are smaller than for the previous model, indicating a better model fit. The model fit can be further analyzed with the value of the Multiple R squared at the bottom of the model summary. In this case it's 0.19, indicating that the model can explain 19 percent of the variance in our dependent variable. In the case of this simple linear regression, this means that differences in attitude explain about a fifth of the variance in exam points.
To analyze how well this model fits the assumptions of linear regression, we produce some model diagnostics below. The assumtions of a linear regression include, that the errors have constant variance implying that the size of the errors is not dependent on the explanatory variables, that the errors are normally distributed, and that the errors are not correlated. From the "Residuals vs Fitted" plot in the top left corner we can see, that the relationship between the residuals and the fitted values is quite random, which indicates that the size of the errors is not dependent on the explanatory variable. In the "Normal Q-Q" plot we see that the errors are reasonably normally distributed, and thus fit the normality assumption, and the results in the "Residuals vs Leverage" plot imply, that no single observation has unusually high impact on the model. These model diagnostics imply a reasonably good fit to the data, so we can trust our results.
# draw diagnostic plots
par(mfrow = c(2,2))
plot(model2, which = c(1,2,5))
date()
## [1] "Thu Nov 26 21:19:09 2020"
Here we go again...
In this chapter I report the logistic regression exercise. I'm using data from UCI Machine Learning Repository. The data has information on student grades, demographic, social and school related features for students in two Portuguese schools. Below is a list of attributes in the data.
#read the data
alc <- read.csv("~/Desktop/IODS/R project for ODS/IODS-project/data/alc.csv")
#print names of variables
colnames(alc)
## [1] "X" "school" "sex" "age" "address"
## [6] "famsize" "Pstatus" "Medu" "Fedu" "Mjob"
## [11] "Fjob" "reason" "nursery" "internet" "guardian"
## [16] "traveltime" "studytime" "failures" "schoolsup" "famsup"
## [21] "paid" "activities" "higher" "romantic" "famrel"
## [26] "freetime" "goout" "Dalc" "Walc" "health"
## [31] "absences" "G1" "G2" "G3" "alc_use"
## [36] "high_use"
I'm studying the effect of sex, age, number of school absences, and final grade (G3) on high alcohol use. My hypotheses are:
Hypothesis 1: male students have a bigger risk of high alcohol use compared to female students Hypothesis 2: the older the student, the bigger the risk of high alcohol use Hypothesis 3: the more school absences, the bigger the risk of high alcohol use Hypothesis 4: the lower the grade, the bigger the risk of high alcohol use
library(ggplot2)
#barplot sex and high use
ggplot(data = alc, aes(x = sex, fill = high_use)) + geom_bar()
#barplot age and high use
ggplot(data = alc, aes(x = age, fill = high_use)) + geom_bar()
#boxplot absences and high use
ggplot(alc, aes(x = high_use, y = absences)) + geom_boxplot()
#cross-tabulation grades and high use
table(highuse = alc$high_use, grades = alc$G3)
## grades
## highuse 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## FALSE 1 0 1 6 4 15 4 19 3 38 16 54 19 31 15 30 5 7
## TRUE 1 1 0 1 2 7 2 8 4 26 10 27 7 7 1 8 2 0
From the plots above we can see, that it does seem, that high use is more common among male students compared to female students, but the relationship with high use and age is not clear based on the bar plot. Based on the boxplot it seems, that high use is indeed to some extent related to higher amonts of absences, and based on the cross-tabulation it seems, that students with higher grades are less likely to belong in the group of high users.
Next I will fit the logistic regression model. From the model summary below we can see, that sex and absences are significantly associated with high alcohol use. Odds ratios in the table below show, that males are approximately 2.68 times more likely to be in the category of high use compared to females. Also, one point increase in absences increases the odds of being in the high use category by approximately 1.06. Age or grade (G3) are not statistically significantly associated with high alcohol use.
# fit the model
model1 <- glm(high_use ~ sex + age + absences + G3, data = alc, family = "binomial")
summary(model1)
##
## Call:
## glm(formula = high_use ~ sex + age + absences + G3, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2729 -0.8326 -0.6292 1.0442 2.1190
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.38339 1.84796 -1.831 0.067120 .
## sexM 0.98580 0.24184 4.076 4.58e-05 ***
## age 0.13985 0.10328 1.354 0.175690
## absences 0.08964 0.02309 3.882 0.000104 ***
## G3 -0.06631 0.03624 -1.830 0.067277 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 423.66 on 377 degrees of freedom
## AIC: 433.66
##
## Number of Fisher Scoring iterations: 4
library(tidyr); library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# compute odds ratios (OR)
OR <- coef(model1) %>% exp
# compute confidence intervals (CI)
CI <- confint(model1) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.03393229 0.0008683319 1.237502
## sexM 2.67996134 1.6781593816 4.338864
## age 1.15010522 0.9401979286 1.410759
## absences 1.09377649 1.0474354900 1.146836
## G3 0.93584524 0.8710385897 1.004426
Next, we look at the predictive power of the model with only sex and absences as predictors.
library(dplyr)
# fit a new model with only significant predictors
model_pred <- glm(high_use ~ sex + absences, data = alc, family = "binomial")
# predict() the probability of high_use
probabilities <- predict(model_pred, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 258 10
## TRUE 88 26
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
g + geom_point()
First, I load the data and look at the structure and dimensions. The dataset includes data on per capita crime rate by town (crim), and different variables regardin the residents, homes, and businesses in these towns. The data has 506 observations and 14 variables.
# access the MASS package
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
# load the data
data("Boston")
# explore the structure and dimensions of the dataset
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
In the plot below we can see how these variables correlate with each other. Negative correlations are marked red, and positive correlations with blue. We can see that there are quite strong positive correlations for example between proportion of non-retail business acres per town (indus) and nitrogen oxides concentration (nox), as well as between index of accessibility to radial highways (rad) and full-value property-tax rate per $10,000 (tax). There are strong negative correlations between weighted mean of distances to five Boston employment centres (dis) and indus, nox, and proportion of owner-occupied units built prior to 1940 (age).
Summaries of these variables show that the scales of these variables vary a lot. Some variables range between 0 and 1 (e.g. chas), while others range between 187 and 711 (tax).
library(corrplot)
## corrplot 0.84 loaded
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ tibble 3.0.3 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ✓ purrr 0.3.4
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x MASS::select() masks dplyr::select()
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston)
#save rounded matrix
cor_matrix <- cor_matrix %>% round(digits = 2)
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
#summaries of the variables
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
To make the variables more comparable I standardize them. Below we can see that the scaled variables are now centered around zero.
# center and standardize variables
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
Next, I create a categorical variable of the crime rate.
# create a quantile vector of crim
bins <- quantile(boston_scaled$crim)
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
And divide the dataset to train and test sets, so that 80% of the data belongs to the train set
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
Next, I fit a linear discriminant analysis on the training dataset. In the plot below we see the results of the analysis, where the observatios are coloured according to the categorical crime rate.
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
lda.arrows <- function(x, myscale = 1, arrow.heads = 0.1, color = 'purple', tex = 0.75, choices = c(1,2)) {
heads = coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale*heads[,choices[1]],
y1 = myscale*heads[,choices[2]],
col = color, length = arrow.heads)
text(myscale*heads[,choices], labels = row.names(heads), cex = tex, col = color, pos = 3)
}
classes = as.numeric(train$crime)
#plot the result
plot(lda.fit, dimen = 2, col = classes)
lda.arrows(lda.fit, myscale = 2)
I save the correct crime rate classes from the data.
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
And predict the classes with the test data. In the table below we see that the "high" category is predicted perfectly, but more errors are made with the other categories. For example from the "med_high" category, only 14 observations are predicted to be in the correct category, while 14 are predicted to be in the "med_low" category, and 3 in the "high" category.
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 12 11 1 0
## med_low 1 20 6 0
## med_high 0 9 13 1
## high 0 0 0 28
Next, I reload the dataset, scale the variables again, and look at euclidean distance between observations.
# reload the data
data(Boston)
# center and standardize variables
boston_scaled <- scale(Boston)
boston_scaled <- data.frame(boston_scaled)
# euclidean distance matrix
eu_dist <- dist(boston_scaled)
Before running a k-means algorithm, I investigate what is the optimal number of clusters. In the plot below we see, that the optimal number of clusters is 2, because the total WCSS drops radically.
set.seed(123) #set seed to ensure results are replicable
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
# visualize the results
library(ggplot2)
qplot(x = 1:k_max, y = twcss, geom = 'line')
Next I run the k-means algorithm with 2 clusters and plot the clusters. In the plot we see that the clusters seem to differ especially regarding the variables indus, nox, age, dis, rad, tax, ptratio, and lstat.
# k-means clustering
km <-kmeans(boston_scaled, centers = 2)
# plot the scaled Boston dataset with clusters
library(GGally)
ggpairs(boston_scaled, mapping = aes(col = as.factor(km$cluster), alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
In this chapter I use the "human" data, more info can be found in https://raw.githubusercontent.com/TuomoNieminen/Helsinki-Open-Data-Science/master/datasets/human_meta.txt
The data includes several indicators from most countries in the world; the ratio of females and males with at least secondary education, the ratio of females and males in the labour force, expeted years of schooling, life expectancy at birth, Gross National Income per capita, maternal mortality ratio, adolescent birth rate, and percentage of female representatives in parliament.
human <- read.csv("~/Desktop/IODS/R project for ODS/IODS-project/data/human.csv", row.names = 1)
summary(human)
## edu_ratio lab_ratio exp_edu exp_life
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI mat_mor ado_birth rep_parl
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
In the plots below we see that some of the variables are approximately normally distributed (e.g. expected years of schooling), while the distribution of other variables is very much skewed. The ratio of females and males with at least secondary education, expeted years of schooling, life expectancy at birth, and Gross National Income per capita are all positively correlated with each other, and negatively correlated with maternal mortality ratio and adolescent birth rate. The ratio of females and males in the labour force and percentage of female representatives in parliament aren't correlated with anything.
library(GGally)
library(dplyr)
library(corrplot)
ggpairs(human)
cor(human)%>%corrplot()
Based on the results below, principal component analysis doesn't seem to work with unstandardized data. The fist principal component seems to capture almost all of the variance, and it's heavily correlated with GNI.
pca_human <- prcomp(human)
summary(pca_human)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000 1.0000
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
Next I standardize the data and re-do the analysis to see if it helps. From the results below we see, that now the first principal component captures about 54 percent of the variability, and the second principal component about 16 percent. The ratio of females and males with at least secondary education, expeted years of schooling, life expectancy at birth, Gross National Income per capita, maternal mortality ratio and adolescent birth rate seem to contribute to the first principal component, while the ratio of females and males in the labour force and percentage of female representatives in parliament contribute to the second principal component.
Based on these results and how the countries are situated in the biplot, the first principal component seems to capture mostly the wealth of the country. The second component captures some aspects of gender equality.
human_std <- scale(human)
pca_human_std <- prcomp(human_std)
summary(pca_human_std)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
## PC8
## Standard deviation 0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion 1.00000
biplot(pca_human_std, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
In this exercise I use the "tea" dataset. There are 36 variables in the dataset, so I will only keep variables "Tea", "How", "how", "sugar", "where", and "lunch".
library(FactoMineR)
library(tidyr)
data(tea)
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300 36
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- dplyr::select(tea, one_of(keep_columns))
str(tea_time)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
Results of the multiple correspondence analysis (MCA) below show, that the first dimension holds about 15 percent of the variance, and the second about 14 percent of the variance. There are 11 dimensions in total, and all of them retain some of the variability. The biplot shows variables drawn on the first two dimensions. Categories of the same variable are coloured with the same colour. The distance between the variable categories gives a measure of their similarity. For example in the bottom right corner we see that people who use unpacked tea buy their tea from a tea shop rather than chain store.
mca <- MCA(tea_time, graph = FALSE)
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.279 0.261 0.219 0.189 0.177 0.156 0.144
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519 7.841
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953 77.794
## Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.141 0.117 0.087 0.062
## % of var. 7.705 6.392 4.724 3.385
## Cumulative % of var. 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139 0.003
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626 0.027
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111 0.107
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841 0.127
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979 0.035
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990 0.020
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347 0.102
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459 0.161
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968 0.478
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898 0.141
## v.test Dim.3 ctr cos2 v.test
## black 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 2.867 | 0.433 9.160 0.338 10.053 |
## green -5.669 | -0.108 0.098 0.001 -0.659 |
## alone -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 3.226 | 1.329 14.771 0.218 8.081 |
## milk 2.422 | 0.013 0.003 0.000 0.116 |
## other 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
plot(mca, invisible=c("ind"), habillage = "quali")
(more chapters to be added similarly as we proceed with the course!)